New York is one of the world's most populous megacities. New York City has been described as the cultural, financial, and media capital of the world, and is a significant influence on commerce, entertainment, research, technology, education, politics, tourism, dining, art, fashion, and sports. It is the most photographed city in the world. NYC attracts people from various cultures and income groups.
A city such huge would also become an epicenter of criminal activity. Using statistical analysis we will try to find some trends in crimes, and predict crime rates using spatial-temporal analytics.
We are using the dataset NYPD Complaint Data from NYC open data. This dataset includes all valid felony, misdemeanor, and violation crimes reported to the New York City Police Department (NYPD) from 2006 to the end of last year (2020).
Importing the libraries we require
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import geopandas
from geopandas import GeoDataFrame
from geopandas import points_from_xy
from geopandas import read_file as gp_readfile
import geoplot
from prophet import Prophet
c:\Users\saiki\anaconda3\envs\nyc_crime\lib\site-packages\tqdm\auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm Importing plotly failed. Interactive plots will not work.
We create a dataframe from the csv using pandas, all our analysis we be based on this.
df = pd.read_csv("./NYPD_Complaint_Data_Historic.csv")
C:\Users\saiki\AppData\Local\Temp\ipykernel_16500\2019213648.py:1: DtypeWarning: Columns (18,20) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv("./NYPD_Complaint_Data_Historic.csv")
Dataset this large may introduce lot of noise and bias our patterns which are outdated. For example and increase in police funding may influence crime patterns in recent years, hence we will create a subset of the data and analyze crime since 2016.
df['REPORTED_DATE'] = pd.to_datetime(df['RPT_DT'], format='%m/%d/%Y', errors='coerce')
df['TIME'] = pd.to_datetime(df.CMPLNT_FR_TM, errors='coerce').dt.hour
df['YEAR'] = df.REPORTED_DATE.dt.year
df['MONTH'] = df.REPORTED_DATE.dt.month
df = df[df.YEAR.gt(2016)]
Convert to geodata frame for spatial analysis
crs={'init':'epsg:4326'}
geometry = points_from_xy(df['Longitude'],df['Latitude'], crs=crs)
df = GeoDataFrame(df, geometry=geometry)
c:\Users\saiki\anaconda3\envs\nyc_crime\lib\site-packages\pyproj\crs\crs.py:130: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6 in_crs_string = _prepare_from_proj_string(in_crs_string)
Data preprocessing can refer to manipulation or dropping of data before it is used in order to ensure or enhance performance, and is an important step in the data mining process.
We start by taking a look into the shape of the dataset, understanding what kind of data we would be dealing with.
df.info()
<class 'geopandas.geodataframe.GeoDataFrame'> Int64Index: 1808176 entries, 0 to 7375992 Data columns (total 40 columns): # Column Dtype --- ------ ----- 0 CMPLNT_NUM int64 1 CMPLNT_FR_DT object 2 CMPLNT_FR_TM object 3 CMPLNT_TO_DT object 4 CMPLNT_TO_TM object 5 ADDR_PCT_CD float64 6 RPT_DT object 7 KY_CD int64 8 OFNS_DESC object 9 PD_CD float64 10 PD_DESC object 11 CRM_ATPT_CPTD_CD object 12 LAW_CAT_CD object 13 BORO_NM object 14 LOC_OF_OCCUR_DESC object 15 PREM_TYP_DESC object 16 JURIS_DESC object 17 JURISDICTION_CODE float64 18 PARKS_NM object 19 HADEVELOPT object 20 HOUSING_PSA object 21 X_COORD_CD float64 22 Y_COORD_CD float64 23 SUSP_AGE_GROUP object 24 SUSP_RACE object 25 SUSP_SEX object 26 TRANSIT_DISTRICT float64 27 Latitude float64 28 Longitude float64 29 Lat_Lon object 30 PATROL_BORO object 31 STATION_NAME object 32 VIC_AGE_GROUP object 33 VIC_RACE object 34 VIC_SEX object 35 REPORTED_DATE datetime64[ns] 36 TIME float64 37 YEAR int64 38 MONTH int64 39 geometry geometry dtypes: datetime64[ns](1), float64(9), geometry(1), int64(4), object(25) memory usage: 565.6+ MB
Lets look at the number of entries and columns we have
df.shape
(1808176, 40)
percent_missing = round(df.isna().sum() / len(df) * 100)
pd.DataFrame({'column_name': df.columns,
'percent_missing': percent_missing})
| column_name | percent_missing | |
|---|---|---|
| CMPLNT_NUM | CMPLNT_NUM | 0.0 |
| CMPLNT_FR_DT | CMPLNT_FR_DT | 0.0 |
| CMPLNT_FR_TM | CMPLNT_FR_TM | 0.0 |
| CMPLNT_TO_DT | CMPLNT_TO_DT | 13.0 |
| CMPLNT_TO_TM | CMPLNT_TO_TM | 13.0 |
| ADDR_PCT_CD | ADDR_PCT_CD | 0.0 |
| RPT_DT | RPT_DT | 0.0 |
| KY_CD | KY_CD | 0.0 |
| OFNS_DESC | OFNS_DESC | 0.0 |
| PD_CD | PD_CD | 0.0 |
| PD_DESC | PD_DESC | 0.0 |
| CRM_ATPT_CPTD_CD | CRM_ATPT_CPTD_CD | 0.0 |
| LAW_CAT_CD | LAW_CAT_CD | 0.0 |
| BORO_NM | BORO_NM | 0.0 |
| LOC_OF_OCCUR_DESC | LOC_OF_OCCUR_DESC | 18.0 |
| PREM_TYP_DESC | PREM_TYP_DESC | 0.0 |
| JURIS_DESC | JURIS_DESC | 0.0 |
| JURISDICTION_CODE | JURISDICTION_CODE | 0.0 |
| PARKS_NM | PARKS_NM | 99.0 |
| HADEVELOPT | HADEVELOPT | 96.0 |
| HOUSING_PSA | HOUSING_PSA | 93.0 |
| X_COORD_CD | X_COORD_CD | 0.0 |
| Y_COORD_CD | Y_COORD_CD | 0.0 |
| SUSP_AGE_GROUP | SUSP_AGE_GROUP | 25.0 |
| SUSP_RACE | SUSP_RACE | 25.0 |
| SUSP_SEX | SUSP_SEX | 25.0 |
| TRANSIT_DISTRICT | TRANSIT_DISTRICT | 98.0 |
| Latitude | Latitude | 0.0 |
| Longitude | Longitude | 0.0 |
| Lat_Lon | Lat_Lon | 0.0 |
| PATROL_BORO | PATROL_BORO | 0.0 |
| STATION_NAME | STATION_NAME | 98.0 |
| VIC_AGE_GROUP | VIC_AGE_GROUP | 0.0 |
| VIC_RACE | VIC_RACE | 0.0 |
| VIC_SEX | VIC_SEX | 0.0 |
| REPORTED_DATE | REPORTED_DATE | 0.0 |
| TIME | TIME | 0.0 |
| YEAR | YEAR | 0.0 |
| MONTH | MONTH | 0.0 |
| geometry | geometry | 0.0 |
We could drop the rows with invalid/empty values which we are not interested in to make the data frame lighter to process, since it is so obviously huge. But we also subsitute invalid/empty values with UNKNOWN for important metrics which contribute to our analysis.
The size of the dataset is 2.19 GB.
df.dropna(subset=['Y_COORD_CD','X_COORD_CD','Latitude','Longitude','CRM_ATPT_CPTD_CD','CMPLNT_FR_TM','Lat_Lon','CMPLNT_FR_DT','BORO_NM','OFNS_DESC','ADDR_PCT_CD'], inplace=True)
df.drop(['PARKS_NM','STATION_NAME','TRANSIT_DISTRICT','HADEVELOPT','HOUSING_PSA'],axis='columns', inplace=True)
df.drop(['JURISDICTION_CODE'], axis='columns', inplace=True)
df.drop(['PD_CD','PD_DESC','PATROL_BORO','CMPLNT_TO_DT','CMPLNT_TO_TM'], axis='columns', inplace=True)
df.fillna({'LOC_OF_OCCUR_DESC':'UNKNOWN'}, inplace=True)
df.fillna({'VIC_RACE':'UNKNOWN'}, inplace=True)
df.fillna({'VIC_AGE_GROUP':'UNKNOWN'}, inplace=True)
df.fillna({'VIC_SEX':'UNKNOWN'}, inplace=True)
df.fillna({'SUSP_RACE':'UNKNOWN'}, inplace=True)
df.fillna({'SUSP_AGE_GROUP':'UNKNOWN'}, inplace=True)
df.fillna({'SUSP_SEX':'UNKNOWN'}, inplace=True)
df["OFNS_DESC"].unique()
array(['DANGEROUS WEAPONS', 'FORGERY', 'HARRASSMENT 2',
'MISCELLANEOUS PENAL LAW', 'BURGLARY', 'DANGEROUS DRUGS',
'PETIT LARCENY', 'OFF. AGNST PUB ORD SENSBLTY &', 'GRAND LARCENY',
'FELONY ASSAULT', 'ASSAULT 3 & RELATED OFFENSES', 'ARSON', 'RAPE',
'SEX CRIMES', 'GRAND LARCENY OF MOTOR VEHICLE', 'ROBBERY',
'CRIMINAL MISCHIEF & RELATED OF', 'THEFT-FRAUD',
'VEHICLE AND TRAFFIC LAWS', 'CRIMINAL TRESPASS',
'OFFENSES INVOLVING FRAUD', 'FRAUDS',
'OFFENSES AGAINST PUBLIC ADMINI', 'OFFENSES AGAINST THE PERSON',
'ADMINISTRATIVE CODE', 'INTOXICATED & IMPAIRED DRIVING',
'ESCAPE 3', 'NYS LAWS-UNCLASSIFIED FELONY',
'POSSESSION OF STOLEN PROPERTY', 'THEFT OF SERVICES',
'KIDNAPPING & RELATED OFFENSES', 'OTHER OFFENSES RELATED TO THEF',
'UNAUTHORIZED USE OF A VEHICLE', "BURGLAR'S TOOLS",
'ENDAN WELFARE INCOMP', 'FRAUDULENT ACCOSTING',
'AGRICULTURE & MRKTS LAW-UNCLASSIFIED',
'OTHER STATE LAWS (NON PENAL LA', 'OFFENSES AGAINST PUBLIC SAFETY',
'GAMBLING', 'PETIT LARCENY OF MOTOR VEHICLE',
'ALCOHOLIC BEVERAGE CONTROL LAW', 'OFFENSES RELATED TO CHILDREN',
'ANTICIPATORY OFFENSES', 'LOITERING/GAMBLING (CARDS, DIC',
'FELONY SEX CRIMES', 'HOMICIDE-NEGLIGENT,UNCLASSIFIE',
'PROSTITUTION & RELATED OFFENSES', 'JOSTLING',
'CHILD ABANDONMENT/NON SUPPORT', 'OTHER STATE LAWS', 'KIDNAPPING',
'NYS LAWS-UNCLASSIFIED VIOLATION', 'DISORDERLY CONDUCT',
'DISRUPTION OF A RELIGIOUS SERV', 'OFFENSES AGAINST MARRIAGE UNCL',
'HOMICIDE-NEGLIGENT-VEHICLE', 'INTOXICATED/IMPAIRED DRIVING',
'KIDNAPPING AND RELATED OFFENSES',
'UNLAWFUL POSS. WEAP. ON SCHOOL', 'OTHER TRAFFIC INFRACTION',
'OTHER STATE LAWS (NON PENAL LAW)', 'FORTUNE TELLING', 'LOITERING',
'NEW YORK CITY HEALTH CODE', 'ABORTION'], dtype=object)
The data shows some issues is type of descriptions so we correct them for example "ASSAULT 3 & RELATED OFFENSES" can be ASSAULT
df_clean = df.replace({'HARRASSMENT 2': 'HARASSMENT',
'ESCAPE 3': 'ESCAPE',
'ASSAULT 3 & RELATED OFFENSES': 'ASSAULT & RELATED OFFENSES',
'CRIMINAL MISCHIEF & RELATED OF': 'CRIMINAL MISCHIEF',
'OFF. AGNST PUB ORD SENSBLTY &': 'OFFENSES AGAINST PUBLIC ORDER/ADMINISTRATION',
'OTHER STATE LAWS (NON PENAL LA': 'OTHER STATE LAWS (NON PENAL LAW)',
'ENDAN WELFARE INCOMP': 'ENDANGERING WELFARE OF INCOMPETENT',
'AGRICULTURE & MRKTS LAW-UNCLASSIFIED': 'AGRICULTURE & MARKETS LAW',
'DISRUPTION OF A RELIGIOUS SERV': 'DISRUPTION OF A RELIGIOUS SERVICE',
'LOITERING/GAMBLING (CARDS, DIC': 'GAMBLING',
'OFFENSES AGAINST MARRIAGE UNCL': 'OFFENSES AGAINST MARRIAGE',
'HOMICIDE-NEGLIGENT,UNCLASSIFIE': 'HOMICIDE-NEGLIGENT',
'E': 'UNKNOWN',
'D': 'BUSINESS/ORGANIZATION',
'F': 'FEMALE',
'M': 'MALE'})
df_clean['TIME'] = df_clean['TIME'].astype('int64')
df_clean['ADDR_PCT_CD'] = df_clean['ADDR_PCT_CD'].astype('int64')
df_clean.head()
| CMPLNT_NUM | CMPLNT_FR_DT | CMPLNT_FR_TM | ADDR_PCT_CD | RPT_DT | KY_CD | OFNS_DESC | CRM_ATPT_CPTD_CD | LAW_CAT_CD | BORO_NM | ... | Longitude | Lat_Lon | VIC_AGE_GROUP | VIC_RACE | VIC_SEX | REPORTED_DATE | TIME | YEAR | MONTH | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 394506329 | 12/31/2019 | 17:30:00 | 32 | 12/31/2019 | 118 | DANGEROUS WEAPONS | COMPLETED | FELONY | MANHATTAN | ... | -73.943324 | (40.82092679700002, -73.94332421899996) | UNKNOWN | UNKNOWN | UNKNOWN | 2019-12-31 | 17 | 2019 | 12 | POINT (-73.94332 40.82093) |
| 1 | 968873685 | 12/29/2019 | 16:31:00 | 47 | 12/29/2019 | 113 | FORGERY | COMPLETED | FELONY | BRONX | ... | -73.861640 | (40.885701406000074, -73.86164032499995) | UNKNOWN | UNKNOWN | UNKNOWN | 2019-12-29 | 16 | 2019 | 12 | POINT (-73.86164 40.88570) |
| 2 | 509837549 | 12/15/2019 | 18:45:00 | 109 | 12/29/2019 | 578 | HARASSMENT | COMPLETED | VIOLATION | QUEENS | ... | -73.819824 | (40.74228115600005, -73.81982408) | 25-44 | WHITE HISPANIC | FEMALE | 2019-12-29 | 18 | 2019 | 12 | POINT (-73.81982 40.74228) |
| 3 | 352454313 | 12/28/2019 | 01:00:00 | 47 | 12/28/2019 | 126 | MISCELLANEOUS PENAL LAW | COMPLETED | FELONY | BRONX | ... | -73.847545 | (40.87531145100007, -73.84754521099995) | UNKNOWN | UNKNOWN | UNKNOWN | 2019-12-28 | 1 | 2019 | 12 | POINT (-73.84755 40.87531) |
| 5 | 293718737 | 12/27/2019 | 22:00:00 | 9 | 12/27/2019 | 107 | BURGLARY | ATTEMPTED | FELONY | MANHATTAN | ... | -73.980466 | (40.72075882100006, -73.98046642299995) | UNKNOWN | UNKNOWN | MALE | 2019-12-27 | 22 | 2019 | 12 | POINT (-73.98047 40.72076) |
5 rows × 29 columns
df_clean.OFNS_DESC.value_counts().iloc[:10].sort_values().plot(kind="barh", title = "Types of Crimes", figsize=(20,10), color = sns.color_palette("gist_rainbow_r"))
<AxesSubplot:title={'center':'Types of Crimes'}>
Based on the data above the major crime that is commited is PETIT LARCENEY
The most common form of Petit Larceny is shoplifting. This charge will apply when a person takes items from a store (unless the items are worth more than $1,000). A person can be charged with this crime even if they don’t leave the store with the items. For example, if someone places an item into their pocket or bag, they might be charged on the ground that they were concealing the item and intending to steal it.
Petit Larceny does not only apply to shoplifting. People have been charged with this crime in New York State for using a doctored MetroCard at subway turnstiles, for removing a landlord’s surveillance cameras from a rental property, and for taking mail from a mailbox. Petit Larceny can also be charged when someone possesses another person’s property and refuses to return it—for example, if an acquaintance loans you a cell phone and you walk off with it.
Felonies are the most serious kinds of crimes. Generally, a crime is considered a felony when it is punishable by more than a year in a state prison (also called a penitentiary). Examples of felonies are murder, rape, burglary, and the sale of illegal drugs.
Misdemeanors are less serious crimes, and are typically punishable by up to a year in county jail. Common misdemeanors include shoplifting, drunk driving, assault, and possession of an unregistered firearm. Often, an offense that is a misdemeanor the first time a person commits it becomes a felony the second time around.
Violation are still less serious violations, like those involving traffic laws, which typically subject a person to nothing more than a monetary fine. Defendants charged with infractions usually have no right to a jury trial or a court-appointed lawyer. But repeat offenders, even when the offense is a mere infraction, may face stiffer penalties or charges. (Some states consider certain kinds of infractions like traffic tickets to be civil, rather than criminal, offenses.)
df_clean.LAW_CAT_CD.value_counts().plot(kind='pie', figsize=(15,10), colors=sns.color_palette("cool"), legend=True, autopct='%1.2f%%', explode=(0, 0, 0.20), shadow=False, startangle=0, title="Levels of law")
<AxesSubplot:title={'center':'Levels of law'}, ylabel='LAW_CAT_CD'>
data_vic_susp_race = df_clean[['VIC_RACE', 'SUSP_RACE']].apply(pd.Series.value_counts).reindex(index = ["BLACK", "WHITE HISPANIC", "WHITE", "BLACK HISPANIC", "ASIAN / PACIFIC ISLANDER", "AMERICAN INDIAN/ALASKAN NATIVE"])
ax = data_vic_susp_race.plot(kind="barh", color =sns.color_palette("twilight_shifted_r"), title = 'Racial analysis', figsize=(15,10))
ax.legend(["Victim Race", "Suspect Race"])
<matplotlib.legend.Legend at 0x196b6384d00>
Based on the data we see we notice BLACK people have highest crime rate in terms of victims and suspects
data_vic_susp_age = df_clean[['VIC_AGE_GROUP', 'SUSP_AGE_GROUP']].apply(pd.Series.value_counts).reindex(index = ["<18", "18-24", "25-44", "45-64", "65+"])
ax = data_vic_susp_age.plot(kind="barh", color = sns.color_palette("terrain"), title = 'Age analysis', figsize=(15,10))
ax.legend(["Victim Age", "Suspect age"])
<matplotlib.legend.Legend at 0x1972699b490>
We see here that most of the crimes and victims are in the age group of 25-44
data_vic_susp_sex = df_clean[['VIC_SEX', 'SUSP_SEX']].apply(pd.Series.value_counts).reindex(index = ["MALE", "FEMALE"])
ax = data_vic_susp_sex.plot(kind="barh", color = sns.color_palette("magma_r"), title = 'Gender analysis', figsize=(15,10))
ax.legend(["Victim Sex", "Suspect sex"])
<matplotlib.legend.Legend at 0x196bfa560a0>
With gender analysis in place we notice there are huge amount of female victims, typically female victims face sex crimes hence lets filter by sex crimes to gain more insight.
sex_crimes_filtered = df_clean[df_clean.OFNS_DESC.str.contains('SEX CRIMES|RAPE', na=False)]
ax = sex_crimes_filtered[['VIC_SEX', 'SUSP_SEX']].apply(pd.Series.value_counts).reindex(index = ["MALE", "FEMALE"]).plot(kind="barh", color = sns.color_palette("rocket_r"), title = 'Sex crime analysis', figsize=(15,10))
ax.legend(["Victim Sex", "Suspect sex"])
<matplotlib.legend.Legend at 0x196f37c8ee0>
This graph confirms our intuition about sexual crimes, female victims face the highest sexua crimes and male suspects are the highest
sex_crimes_filtered_female = df_clean[df.OFNS_DESC.str.contains('SEX CRIMES|RAPE', na=False) & df_clean.VIC_SEX.eq('FEMALE')]
sex_crimes_filtered_vic_age = sex_crimes_filtered_female[['VIC_AGE_GROUP']].apply(pd.Series.value_counts).reindex(index = ["<18", "18-24", "25-44", "45-64", "65+"])
ax = sex_crimes_filtered_vic_age.plot(kind="bar", color = sns.color_palette("cool_r"), title = 'Sex crimes victims by age analysis', figsize=(15,10), legend=False)
sex_crimes_filtered_female.PREM_TYP_DESC.value_counts().iloc[:10].sort_values().plot(kind="barh", title = "Analyze top 10 places of female sex crimes", figsize=(20,10), color = sns.color_palette("cool"))
<AxesSubplot:title={'center':'Analyze top 10 places of female sex crimes'}>
We see that majority of sex crimes happen in or around the place of residence
sex_crimes_filtered_female_minors = df_clean[df_clean.OFNS_DESC.str.contains('SEX CRIMES|RAPE', na=False) & df_clean.VIC_SEX.eq('FEMALE') & df.VIC_AGE_GROUP.eq('<18')]
sex_crimes_filtered_female_minors.PREM_TYP_DESC.value_counts().iloc[:10].sort_values().plot(kind="barh", title = "Analyze top 10 places of female sex crimes of minors", figsize=(20,10), color = sns.color_palette("cool_r"))
<AxesSubplot:title={'center':'Analyze top 10 places of female sex crimes of minors'}>
This follows a similar pattern to that of female group in large
ax = sex_crimes_filtered[['VIC_RACE', 'SUSP_RACE']].apply(pd.Series.value_counts).reindex(index = ["BLACK", "WHITE HISPANIC", "WHITE", "BLACK HISPANIC", "ASIAN / PACIFIC ISLANDER", "AMERICAN INDIAN/ALASKAN NATIVE"]).plot(kind="barh", color = sns.color_palette("magma"), title = 'Racially classified Sex crime analysis', figsize=(15,10))
ax.legend(["Victim Sex", "Suspect sex"])
<matplotlib.legend.Legend at 0x19695e92fd0>
ax = sex_crimes_filtered_female_minors[['VIC_RACE', 'SUSP_RACE']].apply(pd.Series.value_counts).reindex(index = ["BLACK", "WHITE HISPANIC", "WHITE", "BLACK HISPANIC", "ASIAN / PACIFIC ISLANDER", "AMERICAN INDIAN/ALASKAN NATIVE"]).plot(kind="barh", color = sns.color_palette("magma_r"), title = 'Racially classified Sex crime analysis (minors)', figsize=(15,10))
ax.legend(["Victim Sex", "Suspect sex"])
<matplotlib.legend.Legend at 0x196bf8c9e20>
Is our intuition that really women are more prone to sex crimes?
df_clean[df_clean.VIC_SEX.eq('FEMALE')].OFNS_DESC.value_counts().iloc[:20].sort_values().plot(kind="barh", title = "Types of Crimes", figsize=(20,10), color = sns.color_palette("mako"))
<AxesSubplot:title={'center':'Types of Crimes'}>
We notice by year the overal crime rate has been on decline this could probably, may be because of onset of COVID in 2019. More people started staying in homes, self-isolating.
df_clean.groupby('YEAR').size().plot(kind="line", title = "Total Crime Events by Year", figsize=(15,10), color = "coral")
<AxesSubplot:title={'center':'Total Crime Events by Year'}, xlabel='YEAR'>
df_clean.groupby('MONTH').size().plot(kind = 'bar', title ='Total Crime Events by Month', figsize=(15,10), color = sns.color_palette("cool_r") ,rot=0)
<AxesSubplot:title={'center':'Total Crime Events by Month'}, xlabel='MONTH'>
We notice the crime rate is a bit higher around the summers.
Homes may be more tempting to criminals because windows and doors are left open more frequently, and homeowners often spend less time at home when the weather is pleasant. Many criminals are opportunists, and opportunities present themselves more frequently during the summer months. According to studies, the reason may actually be quite simple. As temperatures rise, many people are generally uncomfortable. This discomfort can give rise to aggression which could lead to aggressive criminal activity.
df_clean.groupby('TIME').size().plot(kind = 'bar', title ='Total Crime Events by Day', figsize=(15,10), color = sns.color_palette("viridis"), xlabel = 'hours',rot=0)
<AxesSubplot:title={'center':'Total Crime Events by Day'}, xlabel='hours'>
We can see that the crime is higher around the evening and lunch, we typically assume crime to be higher in the night. But this trend could be possibly due to the fact that it its around the time of commute when most people roam around.
sex_crimes_filtered.groupby('TIME').size().plot(kind = 'bar', title ='Total Sex Crime Events by Day', figsize=(15,10), color = sns.color_palette("spring_r"), xlabel = 'hours',rot=0)
<AxesSubplot:title={'center':'Total Sex Crime Events by Day'}, xlabel='hours'>
From the data we can see the early mornings are the safest for women but midnight is when most of the sex crimes happen.
df_clean['BORO_NM'].value_counts().sort_values().plot(kind="barh", color = sns.color_palette("seismic_r"), title = 'Crime by Borough', figsize=(15,10))
<AxesSubplot:title={'center':'Crime by Borough'}>
This figure gives us a quick picture on how safe different areas are for their safety, but police in NYC are not divided by boroughs NYC rather into precincts. Digging deeper into a precinct, could help us predict crime rate better.
# impoert precincts data file to overlay on maps
pcints = gp_readfile("./Police Precincts.geojson").to_crs(epsg=4326)
pcints = pcints.rename(columns = {"precinct": "ADDR_PCT_CD"})
pcints.ADDR_PCT_CD = df_clean.ADDR_PCT_CD.astype('int64')
#we plot the crime data with the precincts
sx3 = df_clean.groupby("ADDR_PCT_CD").ADDR_PCT_CD.count().to_frame()
c = pcints.join(sx3, on="ADDR_PCT_CD", how="left", lsuffix="ptable_")
c = c.dropna()
c.explore(
column="ADDR_PCT_CD", # make choropleth based on "BoroName" column
tooltip="ADDR_PCT_CD", # show "BoroName" value in tooltip (on hover)
popup=True, # show all values in popup (on click)
tiles="CartoDB positron", # use "CartoDB positron" tiles
cmap="Reds", # use "Set1" matplotlib colormap
style_kwds=dict(color="black"), # use black outline
marker_kwds=dict(radius=10, fill=True)
)
So far we made use of the data we have to have analytical glance of data. We see from the above map that each precinct in a city have different crime rates. Predicting crimes on the whole for the city is not useful, since each precicnt may have diffrent response for each level.
We also learnt from previous data that crimes are seasonal, our dataset serves as a time series to create a temporal dimension, and our GEOGRAPHICAL awareness in the dataset provides us to evaulate each precinct seperately and forecast the data into the future.
Crimes could be divided into two types oppurtunistic and planned. Planned crimes like Burglary, Grand Theft Auto and Grand Larceny are observed seasonally. The are meticulously planned for a successful outcome and follow a specific trend.
Whereas oppurtunistic crimes happen whenever a perpetuator sees an opportunity to commit crime, like robbery.
We will be using the crime data from 2017 to 2019 to predict the crimes of 2020, we have 2020 data in the dataset handy to use it as our test data.
We would be forecasting both burglaries and robberies, since burglary is a planned crime whereas robbery is a oppurunstic crime. This would allow us to build a base model using which we can predict other combinations of crimes.
why are we mixing opportunistic and planned crime in the same analysis? Since crime follows sesonality. We understood that in our temporal analysis. While the success rates of both crimes may differ, they still follow a pattern.
# test data split
df_burglary_train = df_clean[df_clean.ADDR_PCT_CD.eq(75) & df_clean.OFNS_DESC.str.contains("BURGLARY|GRAND|ROBBERY") & df_clean.YEAR.lt(2020)].groupby(pd.Grouper(key='REPORTED_DATE', freq='D')).agg(count=('REPORTED_DATE', 'count')).reset_index().rename(columns={"REPORTED_DATE": "ds", "count": "y"}).groupby(pd.Grouper(key='ds', freq='MS')).agg('count').reset_index().rename(columns={"ds": "ds", "y": "y"})
We will be using Facebook's Prophet library for our forecast.
Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.
Prophet is used in many applications across Facebook for producing reliable forecasts for planning and goal setting.
Our use case perfectly fits to the tool in hand, PERFECT!
m = Prophet(changepoint_range = 0.5, yearly_seasonality=True, changepoint_prior_scale=0.16).fit(df_burglary_train)
future = m.make_future_dataframe(periods=12, freq='MS')
fcst = m.predict(future)
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this. INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. c:\Users\saiki\anaconda3\envs\nyc_crime\lib\site-packages\prophet\forecaster.py:896: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. components = components.append(new_comp) INFO:prophet:n_changepoints greater than number of observations. Using 17. c:\Users\saiki\anaconda3\envs\nyc_crime\lib\site-packages\prophet\forecaster.py:896: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. components = components.append(new_comp) c:\Users\saiki\anaconda3\envs\nyc_crime\lib\site-packages\prophet\forecaster.py:896: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. components = components.append(new_comp)
fig = m.plot(fcst, figsize=(15, 10))
While we got our predictions, we still need to test our data for accuracy, hence we utilise our test data to score our predictions.
from sklearn.metrics import mean_absolute_error
# Split the test data from main data
g1 = df_clean[df_clean.ADDR_PCT_CD.eq(75) & df_clean.OFNS_DESC.str.contains("BURGLARY|GRAND|ROBBERY") & df_clean.YEAR.gt(2019)].groupby(pd.Grouper(key='REPORTED_DATE', freq='D')).agg(count=('REPORTED_DATE', 'count')).reset_index().rename(columns={"REPORTED_DATE": "ds", "count": "y"})
df_burglary_test = g1.groupby(pd.Grouper(key='ds', freq='MS')).agg('count').reset_index().rename(columns={"ds": "ds", "y": "y"})
# find Mean Absolute Error and plot the graph
y_true = df_burglary_test['y'].values
y_pred = fcst['yhat'][-12:].values
mae = mean_absolute_error(y_true, y_pred)
print('MAE: %.3f' % mae)
plt.figure(figsize=(15,10))
plt.plot(y_true, label='Actual')
plt.plot(y_pred, label='Predicted')
plt.legend()
plt.set_cmap(sns.color_palette("twilight_shifted", as_cmap=True))
plt.show()
MAE: 0.215
Crime has an interesting, pattern when looked at a spatial temporal approach. Is it a marker to rely on? Patterns seems to be consistent over the years. Which help us make a forecast with an accuracy.
Crimes trends are influenced with various external factors, though the patterns remain similar. Trends help us be prepared while patterns help us to understand them.
Few decades ago cars used leaded fuel, at the peak of its usage the lead caused so many issues in brain development. Crimes rates increased by the mid 80's. Then once govts started phasing out leaded fuel, crime trends dropped. This research was conducted by several envrionmental study groups.
Our analyisis dived into crime patterns in NYC, and we have observed various insights and forecasted the crime patterns.
Our forecast focusses on spatial-temporal dimensions and seasonalities. It needs to factor in more markers like local education, police funding etc.